Predictive analytics: Poor outcome - Hemorrhage

Thoughts

Questions

  • What deficit are we talking about? Is it covered by the mRS?
  • Should the pre-treatment mRS be in the model? Already covered by the others?

Backlog

  • Honest estimation
  • G estimation (using only significant ones to get accurate ORs)
  • Make sure that the number of outcomes (26) is valid.
  • Bootstrap the whole process - how many models include each of the features?
  • The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
  • Choose best model with forward/backward exclusion
  • Feature selection with honest estimation + p-value-based + bootstrap
  • Feature selection with honest estimation + LASSO-based + bootstrap
  • Implement PCA to create independent variables and not drop variables
  • Bayesian fitting
  • Change to use tidybayes and tidymodels
  • Use neural networks to fit
  • Consider a time-to-event analysis

Setup

Imports

# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)

# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true

# Source code

Configurations

# Source outcome-specific configs
source(params$CONFIG)

# Destination locations
DST_DIRNAME <- paste0("../../outputs/")

# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"

# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE

# Set the seed
set.seed(1891)

Parameters

EXPOSURES_CONTINUOUS <- c(
  "Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
  "mRS (presentation)" = "modified_rankin_score_presentation",
  "mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
  "mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
  "mRS (final)" = "modified_rankin_score_final",
  "Nidus size (cm)" = "max_size_cm",
  "Spetzler-Martin grade" = "spetzler_martin_grade",
  "Lawton-Young grade" = "lawton_young_grade",
  "Size score" = "size_score"
)

EXPOSURES_BINARY <- c(
  "Sex (Male)" = "is_male",
  "Involves eloquent location" = "is_eloquent_location",
  "Has deep venous drainage" = "has_deep_venous_drainage",
  "Diffuse nidus" = "is_diffuse_nidus",
  "Hemorrhage at presentation" = "has_hemorrhage",
  "Seizures at presentation" = "has_seizures",
  "Deficit at presentation" = "has_deficit",
  "Paresis at presentation" = "has_paresis",
  "Presence of aneurysms" = "has_associated_aneurysm",
  "Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
  "Had surgery" = "is_surgery"
)

EXPOSURES_CATEGORICAL <- c(
  "Location" = "location",
  "Venous drainage" = "venous_drainage",
  "Modality of treatment" = "procedure_combinations"
)

OUTCOMES <- c(
  "Poor outcome (mRS >= 3)" = "is_poor_outcome",
  "Obliteration" = "is_obliterated",
  "Complications - minor" = "has_complication_minor",
  "Complications - major" = "has_complication_major",
  "mRS change (final - pre-treatment)" =
    "modified_rankin_score_final_minus_pretx",
  "mRS change (final - presentation)" =
    "modified_rankin_score_final_minus_presentation",
  "mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)

Outputs

Create lists.

plots <- list()
tables <- list()
models <- list()

Functions

R utils.

Data analysis utils.


Read

# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)

# Read
patients_ <-
  read_csv(filepath) %>%
  dplyr::sample_frac(params$PROPORTION_OF_DATA)

# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
  patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
  patients_ <- patients_ %>% filter(!has_hemorrhage)
}

Conform

Setup

Remove all empty rows.

patients_ <-
  patients_ %>%
  filter(!is.na(patient_id)) %>%
  arrange(patient_id)

Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.

df_uni <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

df_multi <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

Recode columns

For venous_drainage if “Both”, mark as “Deep.”

df_multi <-
  df_multi %>%
  mutate(
    venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
  )

For procedure_combinations, change > 1 to multimodal to reduce levels.

df_multi <-
  df_multi %>%
  mutate(procedure_combinations = case_when(
    nchar(procedure_combinations) > 1 ~ "Multimodal",
    .default = procedure_combinations
  ))

For procedure_combinations, indicate if surgery-based or not.

df_multi <-
  df_multi %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

df_uni <-
  df_uni %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

For location, reduce choices (use “Other”).

df_multi <-
  df_multi %>%
  mutate(
    location = ifelse(location == "Corpus Callosum", "Other", location),
    location = ifelse(location == "Cerebellum", "Other", location),
    # location = ifelse(location == "Deep", "Other", location),
    location = factor(location),
    location = relevel(location, ref = "Other")
  )

For spetzler_martin_grade, create a binary variant of 1-3 vs. 4-5.

# For multivariable analysis
df_multi <-
  df_multi %>%
  mutate(
    # Divides into low grade vs. high grade
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

# For univariable analysis
df_uni <-
  df_uni %>%
  mutate(
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

Missingness

# Get cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
))

# Count missingness
df_multi %>%
  select(cols) %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(everything(), values_to = "num_missing") %>%
  arrange(desc(num_missing)) %>%
  filter(num_missing > 0) %>%
  sable()
name num_missing
modified_rankin_score_pretreatment 2
modified_rankin_score_postop_within_1_week 2
is_surgery 2
procedure_combinations 2
modified_rankin_score_final_minus_pretx 2
has_complication_minor 1
has_complication_major 1

Which eligible patients are missing each variable?

df_multi %>%
  filter(is_eligible) %>%
  select(mrn, cols) %>%
  mutate(across(-mrn, is.na)) %>%
  pivot_longer(
    cols = -mrn,
    names_to = "name",
    values_to = "is_missing"
  ) %>%
  filter(is_missing) %>%
  group_by(name) %>%
  summarize(mrn = str_c(mrn, collapse = ", ")) %>%
  sable()
name mrn
has_complication_major 62023254
has_complication_minor 62023254
is_surgery 49659493, 46166047
modified_rankin_score_final_minus_pretx 49659493, 46166047
modified_rankin_score_postop_within_1_week 49659493, 46166047
modified_rankin_score_pretreatment 49659493, 46166047
procedure_combinations 49659493, 46166047

Visualizations

Overall

Distribution of values of the exposures across levels of the outcome.

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = CHOSEN_OUTCOME,
    color = CHOSEN_OUTCOME
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11),
    legend.position = "none"
  )

By hemorrhage

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Hemorrhage at presentation`",
    color = "`Hemorrhage at presentation`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )

By SM grade high vs. low

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Spetzler-Martin grade < 4`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Spetzler-Martin grade < 4`",
    color = "`Spetzler-Martin grade < 4`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )


Descriptive statistics

Setup

Descriptive statistics - continuous variables.

compute_descriptive_continuous <- function(
    df = df_uni,
    cols = c(EXPOSURES_CONTINUOUS),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Apply desired formatting to numbers
  format_numbers <- function(x) {
    sprintf("%.1f", x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(cols, CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      mean = mean(values, na.rm = TRUE),
      sd = sd(values, na.rm = TRUE),
      min = quantile(values, 0, na.rm = TRUE),
      max = quantile(values, 1, na.rm = TRUE),
      q_50 = quantile(values, 0.50, na.rm = TRUE),
      q_25 = quantile(values, 0.25, na.rm = TRUE),
      q_75 = quantile(values, 0.75, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(keys) %>%
    summarize(
      pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Add sample size to column names
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Descriptive statistics - binary variables.

compute_descriptive_binary <- function(
    df = df_uni,
    cols = c(EXPOSURES_BINARY),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Define arguments
  cols <- c(cols[!cols == interaction_col])

  # Apply desired formatting to numbers
  format_numbers <- function(x, decimals = 0) {
    decimals <- paste0("%.", decimals, "f")
    sprintf(decimals, x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      num_with = sum(values, na.rm = TRUE),
      num_without = sum(!values, na.rm = TRUE),
      pct_with = mean(values, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(keys) %>%
    summarize(
      pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Prettify
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Non-specific

# Rename dataframe
`Pediatric AVMs` <-
  df_uni %>%
  filter(is_eligible) %>%
  select(-is_eligible, -comments)

# Create summary
`Pediatric AVMs` %>%
  summarytools::dfSummary(display.labels = FALSE) %>%
  print(
    file =
      "../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
    footnote = NA
  )

# Remove unwanted dataframe
remove(`Pediatric AVMs`)

By hemorrhage - table1 (WIP)

https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html

library(table1)

table1::table1(
  ~ factor(is_male) + age_at_first_treatment_yrs + factor(is_eloquent_location) + max_size_cm | has_hemorrhage,
  data = df_uni
)
# Initialize values
df <- df_uni

# Remove missing values in stratification variables
df <-
  df %>%
  drop_na(has_hemorrhage, CHOSEN_OUTCOME)

# Define columns
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  OUTCOMES[OUTCOMES == CHOSEN_OUTCOME]
)

# Assign labels to the variables in the dataframe
for (var in cols) {
  label(df[[var]]) <- names(cols[cols == var])
}

# Assign meaningful labels to the has_hemorrhage variable
df$has_hemorrhage <- factor(
  df$has_hemorrhage,
  levels = c(FALSE, TRUE),
  labels = c("No Hemorrhage", "Has Hemorrhage")
)

df[, CHOSEN_OUTCOME] <- factor(
  df %>% pull(CHOSEN_OUTCOME),
  levels = CHOSEN_OUTCOME_LEVELS,
  labels = CHOSEN_OUTCOME_LABELS
)

# Define custom rendering functions if needed
render_continuous <- function(x) {
  with(stats.apply.rounding(stats.default(x), digits = 2), c(
    "Mean (SD)" = sprintf("%s (&plusmn; %s)", MEAN, SD),
    "Median (IQR)" = sprintf("%s (%s - %s)", MEDIAN, Q1, Q3)
  ))
}

compute_pvalues <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times = sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001)))
}

# Create the descriptive table
frla <- as.formula(paste("~ . | has_hemorrhage *", CHOSEN_OUTCOME))

# Create table
table1_descriptive_statistics <-
  table1(frla,
    data = df[, unname(cols)],
    render.continuous = render_continuous,
    overall = c(left = "Overall"),
    topclass = "Rtable1-zebra"
  )

# Print
table1_descriptive_statistics

By hemorrhage

if (!str_detect(params$TITLE, "Hemorrhage")) {
  
  # Define arguments
  interaction_col <- "has_hemorrhage"
  interaction_col_name <- "Hemorrhage at presentation"
  prefix_true <- "Haem"
  prefix_false <- "No Haem"
  
  # Get all
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out_all <- bind_rows(out)
  
  # Get with hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out_haem <- bind_rows(out)
  
  # Get without hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out_no_haem <- bind_rows(out)
  
  # Synthesize
  out_complete <-
    out_all %>%
    left_join(out_haem, by = "keys") %>%
    left_join(out_no_haem, by = "keys") %>%
    arrange(`All - P-value`)
  
  # Print
  out_complete %>%
    sable()
  
}

By diffuse nidus

# Define arguments
interaction_col <- "is_diffuse_nidus"
interaction_col_name <- "Is diffuse nidus"
prefix_true <- "Diffuse"
prefix_false <- "Not Diffuse"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Poor outcome (n=15) All - Good outcome (n=122) All - P-value Diffuse - Poor outcome (n=4) Diffuse - Good outcome (n=12) Diffuse - P-value Not Diffuse - Poor outcome (n=11) Not Diffuse - Good outcome (n=110) Not Diffuse - P-value
mRS (final) 0.7 (0.7) 3.9 (1.0) 0.00000 0.9 (0.7) 5.0 (1.2) 0.00277 0.7 (0.7) 3.5 (0.7) 0.00000
mRS (1-week post-op) 1.7 (1.2) 4.0 (0.8) 0.00000 1.4 (1.2) 4.2 (0.5) 0.00569 1.7 (1.2) 3.9 (0.8) 0.00000
mRS (pre-treatment) 1.9 (1.5) 3.3 (1.5) 0.00162 1.6 (1.4) 4.0 (0.8) 0.01574 1.9 (1.5) 3.0 (1.6) 0.02946
Spetzler-Martin grade < 4 8 (53%) 98 (80%) 0.01881 1 (25%) 3 (25%) 1.00000 7 (64%) 95 (86%) 0.04916
Involves eloquent location 12 (80%) 64 (52%) 0.04360 4 (100%) 10 (83%) 0.39802 8 (73%) 54 (49%) 0.13644
Spetzler-Martin grade 2.6 (1.1) 3.3 (1.3) 0.04461 4.2 (0.9) 4.2 (1.0) 1.00000 2.4 (1.0) 2.9 (1.3) 0.17544
Is diffuse nidus 4 (27%) 12 (10%) 0.05636 4 (100%) 12 (100%) 0 (0%) 0 (0%)
Age at 1st treatment (years) 11.2 (3.4) 9.1 (4.2) 0.06577 11.9 (4.0) 8.5 (2.1) 0.14145 11.1 (3.3) 9.4 (4.9) 0.25618
Lawton-Young grade 3.8 (1.4) 4.5 (1.6) 0.07814 6.5 (0.8) 6.2 (1.0) 0.74441 3.5 (1.1) 3.9 (1.3) 0.31227
Seizures at presentation 1 (7%) 28 (23%) 0.14661 1 (25%) 4 (33%) 0.76302 0 (0%) 24 (22%) 0.08487
Size score 1.5 (0.6) 1.7 (0.8) 0.20772 2.4 (0.7) 2.2 (1.0) 0.84113 1.4 (0.6) 1.5 (0.7) 0.38925
mRS (presentation) 3.0 (1.7) 3.7 (1.5) 0.24509 2.0 (1.8) 4.2 (0.5) 0.06180 3.2 (1.7) 3.5 (1.8) 0.56954
Sex (Male) 6 (40%) 67 (55%) 0.27623 2 (50%) 6 (50%) 1.00000 4 (36%) 61 (55%) 0.22791
Paresis at presentation 6 (40%) 34 (28%) 0.33126 3 (75%) 4 (33%) 0.15896 3 (27%) 30 (27%) 1.00000
Nidus size (cm) 2.9 (1.9) 3.9 (3.2) 0.33159 5.5 (1.9) 6.8 (4.7) 0.58304 2.6 (1.6) 2.8 (1.7) 0.64457
Had surgery 9 (60%) 81 (68%) 0.56273 1 (25%) 8 (67%) 0.15896 8 (73%) 73 (68%) 0.72895
Deficit at presentation 6 (40%) 45 (37%) 0.81450 3 (75%) 4 (33%) 0.15896 3 (27%) 41 (37%) 0.51269
Presence of aneurysms 4 (27%) 30 (25%) 0.86104 1 (25%) 2 (17%) 0.72030 3 (27%) 28 (25%) 0.89564
Has deep venous drainage 9 (60%) 73 (60%) 0.99028 3 (75%) 12 (100%) 0.08326 6 (55%) 61 (55%) 0.95407
Hemorrhage at presentation 15 (100%) 122 (100%) 4 (100%) 12 (100%) 11 (100%) 110 (100%)

By SM grade high vs. low

# Define arguments
interaction_col <- "is_spetzler_martin_grade_less_than_4"
interaction_col_name <- "Spetzler-Martin grade < 4"
prefix_true <- "Low grade"
prefix_false <- "High grade"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Poor outcome (n=15) All - Good outcome (n=122) All - P-value Low grade - Poor outcome (n=8) Low grade - Good outcome (n=98) Low grade - P-value High grade - Poor outcome (n=7) High grade - Good outcome (n=24) High grade - P-value
mRS (final) 0.7 (0.7) 3.9 (1.0) 0.00000 0.7 (0.7) 3.8 (1.2) 0.00000 0.8 (0.7) 4.1 (0.9) 0.00003
mRS (1-week post-op) 1.7 (1.2) 4.0 (0.8) 0.00000 1.7 (1.3) 3.9 (0.8) 0.00010 1.6 (0.9) 4.1 (0.7) 0.00010
mRS (pre-treatment) 1.9 (1.5) 3.3 (1.5) 0.00162 1.9 (1.6) 3.2 (1.4) 0.01723 1.7 (1.1) 3.3 (1.7) 0.02826
Spetzler-Martin grade < 4 8 (53%) 98 (80%) 0.01881 8 (100%) 98 (100%) 0 (0%) 0 (0%)
Involves eloquent location 12 (80%) 64 (52%) 0.04360 5 (62%) 42 (43%) 0.28451 7 (100%) 22 (92%) 0.43727
Spetzler-Martin grade 2.6 (1.1) 3.3 (1.3) 0.04461 2.2 (0.8) 2.2 (0.9) 0.76348 4.3 (0.5) 4.4 (0.5) 0.52118
Diffuse nidus 4 (27%) 12 (10%) 0.05636 1 (12%) 3 (3%) 0.18001 3 (43%) 9 (38%) 0.80114
Age at 1st treatment (years) 11.2 (3.4) 9.1 (4.2) 0.06577 11.0 (3.3) 9.2 (4.8) 0.38076 12.1 (3.6) 9.0 (3.8) 0.08290
Lawton-Young grade 3.8 (1.4) 4.5 (1.6) 0.07814 3.3 (1.0) 3.4 (1.1) 0.90481 5.8 (1.0) 5.9 (0.9) 0.94021
Seizures at presentation 1 (7%) 28 (23%) 0.14661 0 (0%) 20 (20%) 0.15799 1 (14%) 8 (33%) 0.33655
Size score 1.5 (0.6) 1.7 (0.8) 0.20772 1.2 (0.4) 1.1 (0.4) 0.44946 2.4 (0.5) 2.4 (0.5) 0.97796
mRS (presentation) 3.0 (1.7) 3.7 (1.5) 0.24509 3.2 (1.7) 4.0 (1.4) 0.24837 2.5 (1.8) 3.4 (1.7) 0.31867
Sex (Male) 6 (40%) 67 (55%) 0.27623 3 (38%) 57 (58%) 0.25911 3 (43%) 10 (42%) 0.95594
Paresis at presentation 6 (40%) 34 (28%) 0.33126 3 (38%) 28 (29%) 0.59523 3 (43%) 6 (25%) 0.36762
Nidus size (cm) 2.9 (1.9) 3.9 (3.2) 0.33159 2.3 (1.2) 1.8 (0.7) 0.29328 5.3 (2.2) 6.2 (3.3) 0.42004
Had surgery 9 (60%) 81 (68%) 0.56273 5 (62%) 69 (72%) 0.57574 4 (57%) 12 (50%) 0.74342
Deficit at presentation 6 (40%) 45 (37%) 0.81450 3 (38%) 38 (39%) 0.94348 3 (43%) 7 (29%) 0.50242
Presence of aneurysms 4 (27%) 30 (25%) 0.86104 2 (25%) 24 (24%) 0.97439 2 (29%) 6 (25%) 0.85173
Has deep venous drainage 9 (60%) 73 (60%) 0.99028 4 (50%) 51 (52%) 0.91197 5 (71%) 22 (92%) 0.16681
Hemorrhage at presentation 15 (100%) 122 (100%) 8 (100%) 98 (100%) 7 (100%) 24 (100%)

Associations

Correlation matrix.

# Cols
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
)

# Remove hemorrhage
if (str_detect(params$TITLE, "Hemorrhage")) {
  cols <- cols[!cols == "has_hemorrhage"]
}
  
# Create new dataframe
df_ <-
  df_uni %>%
  select(all_of(cols))

# Convert the outcome variable to numeric for correlation calculation
df_ <-
  df_ %>%
  select(where(is.numeric), where(is.logical)) %>%
  mutate(across(everything(), as.numeric))

# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")

# Plot the correlation matrix
ggcorrplot::ggcorrplot(
  cor_matrix,
  method = "circle",
  lab = TRUE,
  lab_size = 2,
  colors = c("red", "white", "green4"),
  title = "Correlation Matrix",
  hc.order = TRUE,
) +
  theme(
    axis.text.x = element_text(size = 8), # Adjust x-axis text size
    axis.text.y = element_text(size = 8) # Adjust y-axis text size
  )


Univariable statistics

Setup

Define function.

fit_model <- function(
    df = df_uni, cols = cols, is_sandwich = T) {
  # Initialize
  out <- list()

  for (i in seq_along(cols)) {
    # Create formula
    model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

    fit <- suppressWarnings(glm(
      model,
      data = df,
      family = binomial()
    ))

    if (is_sandwich) {
      # Calculate robust standard errors (sandwich)
      robust_se <- sandwich::vcovHC(fit, type = "HC0")
      # Compute robust confidence intervals
      fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
    } else {
      fit_results <- fit
    }

    # Summarize model coefficients
    fit_summary <-
      fit_results %>%
      # broom does not support exponentiation after coeftest so do it manually
      broom::tidy(conf.int = T) %>%
      mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
      arrange(term == "(Intercept)", p.value) %>%
      rename(odds_ratio = estimate, z_value = statistic)

    # Stylize and print
    stylized <-
      fit_summary %>%
      rename(
        "Predictors" = term,
        "Odds Ratios (OR)" = odds_ratio,
        "SE" = std.error,
        "Z-scores" = z_value,
        "P-values" = p.value,
        "CI (low)" = conf.low,
        "CI (high)" = conf.high,
      ) %>%
      mutate(
        `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
        `SE` = round(`SE`, 2),
        `CI (low)` = round(`CI (low)`, 2),
        `CI (high)` = round(`CI (high)`, 2),
        `P-values` = round(`P-values`, 3),
      )

    out[[i]] <- stylized
  }
  return(out)
}

Unadjusted - Original

Fit logistic.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)

# Create table
univariable_unadjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_postop_within_1_week 4.87 0.28 5.59565 0.000 2.80 8.47
locationCorpus Callosum 0.00 0.75 -21.18946 0.000 0.00 0.00
locationOther 0.00 1.18 -13.45130 0.000 0.00 0.00
procedure_combinationsES 0.00 1.18 -14.76269 0.000 0.00 0.00
modified_rankin_score_pretreatment 1.69 0.16 3.35561 0.001 1.24 2.29
is_spetzler_martin_grade_less_than_4TRUE 0.28 0.57 -2.25200 0.024 0.09 0.85
is_eloquent_locationTRUE 3.62 0.67 1.92082 0.055 0.97 13.49
spetzler_martin_grade 1.65 0.27 1.88393 0.060 0.98 2.78
age_at_first_treatment_yrs 0.84 0.09 -1.84154 0.066 0.70 1.01
is_diffuse_nidusTRUE 3.33 0.66 1.82897 0.067 0.92 12.11
lawton_young_grade 1.38 0.19 1.71570 0.086 0.95 2.00
max_size_cm 1.21 0.11 1.67336 0.094 0.97 1.51
modified_rankin_score_presentation 1.28 0.17 1.49517 0.135 0.93 1.77
size_score 1.70 0.39 1.35790 0.174 0.79 3.63
has_seizuresTRUE 0.24 1.06 -1.35064 0.177 0.03 1.90
locationHemispheric 0.37 0.78 -1.27202 0.203 0.08 1.71
procedure_combinationsR 0.17 1.55 -1.15926 0.246 0.01 3.45
is_maleTRUE 0.55 0.56 -1.08116 0.280 0.18 1.63
has_paresisTRUE 1.73 0.56 0.96652 0.334 0.57 5.22
procedure_combinationsS 0.31 1.27 -0.92921 0.353 0.03 3.70
procedure_combinationsRS 0.38 1.38 -0.71235 0.476 0.03 5.57
is_surgeryTRUE 0.72 0.56 -0.57912 0.563 0.24 2.17
locationDeep 1.38 0.76 0.42707 0.669 0.31 6.12
procedure_combinationsER 0.67 1.28 -0.31672 0.751 0.05 8.20
venous_drainageDeep 0.81 0.75 -0.28367 0.777 0.18 3.53
has_deficitTRUE 1.14 0.56 0.23537 0.814 0.38 3.42
venous_drainageSuperficial 0.86 0.75 -0.20453 0.838 0.20 3.75
has_associated_aneurysmTRUE 1.12 0.62 0.17563 0.861 0.33 3.76
procedure_combinationsERS 0.82 1.33 -0.15137 0.880 0.06 11.00
has_deep_venous_drainageTRUE 1.01 0.56 0.01222 0.990 0.34 3.01

Adjusted - Original

Fit logistic.

cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_uni

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_postop_within_1_week 6.26 0.33 5.51207 0.000 3.26 12.01
locationCorpus Callosum 0.00 0.68 -23.11674 0.000 0.00 0.00
locationOther 0.00 1.23 -12.75136 0.000 0.00 0.00
procedure_combinationsES 0.00 1.27 -13.91525 0.000 0.00 0.00
modified_rankin_score_pretreatment 1.66 0.16 3.12554 0.002 1.21 2.29
is_eloquent_locationTRUE 3.90 0.61 2.23437 0.025 1.18 12.85
spetzler_martin_grade 1.81 0.28 2.12267 0.034 1.05 3.14
is_spetzler_martin_grade_less_than_4TRUE 0.27 0.65 -2.04113 0.041 0.08 0.95
lawton_young_grade 1.44 0.19 1.88025 0.060 0.98 2.09
is_diffuse_nidusTRUE 3.47 0.71 1.75503 0.079 0.86 13.95
size_score 1.93 0.43 1.50957 0.131 0.82 4.52
max_size_cm 1.22 0.14 1.45185 0.147 0.93 1.59
procedure_combinationsR 0.10 1.66 -1.39609 0.163 0.00 2.54
has_seizuresTRUE 0.23 1.07 -1.37361 0.170 0.03 1.87
locationHemispheric 0.41 0.69 -1.27431 0.203 0.11 1.61
procedure_combinationsS 0.20 1.32 -1.22962 0.219 0.01 2.63
procedure_combinationsRS 0.21 1.36 -1.15824 0.247 0.01 2.97
modified_rankin_score_presentation 1.21 0.17 1.13510 0.256 0.87 1.69
has_paresisTRUE 1.95 0.59 1.12525 0.260 0.61 6.25
locationDeep 1.89 0.69 0.93076 0.352 0.49 7.26
has_deficitTRUE 1.58 0.62 0.73729 0.461 0.47 5.39
is_surgeryTRUE 0.67 0.57 -0.71431 0.475 0.22 2.02
has_associated_aneurysmTRUE 1.34 0.63 0.46332 0.643 0.39 4.64
procedure_combinationsER 0.54 1.38 -0.44206 0.658 0.04 8.18
venous_drainageSuperficial 0.73 0.81 -0.39236 0.695 0.15 3.58
venous_drainageDeep 0.73 0.80 -0.39047 0.696 0.15 3.51
procedure_combinationsERS 0.60 1.54 -0.32726 0.743 0.03 12.30
has_deep_venous_drainageTRUE 1.10 0.59 0.16731 0.867 0.35 3.47

Unadjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_unadjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_postop_within_1_week 4.87 0.28 5.59565 0.000 2.80 8.47
modified_rankin_score_pretreatment 1.69 0.16 3.35561 0.001 1.24 2.29
is_spetzler_martin_grade_less_than_4TRUE 0.28 0.57 -2.25200 0.024 0.09 0.85
is_eloquent_locationTRUE 3.62 0.67 1.92082 0.055 0.97 13.49
spetzler_martin_grade 1.65 0.27 1.88393 0.060 0.98 2.78
age_at_first_treatment_yrs 0.84 0.09 -1.84154 0.066 0.70 1.01
is_diffuse_nidusTRUE 3.33 0.66 1.82897 0.067 0.92 12.11
lawton_young_grade 1.38 0.19 1.71570 0.086 0.95 2.00
max_size_cm 1.21 0.11 1.67336 0.094 0.97 1.51
modified_rankin_score_presentation 1.28 0.17 1.49517 0.135 0.93 1.77
size_score 1.70 0.39 1.35790 0.174 0.79 3.63
has_seizuresTRUE 0.24 1.06 -1.35064 0.177 0.03 1.90
procedure_combinationsR 0.17 1.55 -1.15926 0.246 0.01 3.45
is_maleTRUE 0.55 0.56 -1.08116 0.280 0.18 1.63
has_paresisTRUE 1.73 0.56 0.96652 0.334 0.57 5.22
procedure_combinationsS 0.31 1.27 -0.92921 0.353 0.03 3.70
locationDeep 1.99 0.75 0.92069 0.357 0.46 8.58
locationHemispheric 0.53 0.77 -0.82015 0.412 0.12 2.40
procedure_combinationsMultimodal 0.45 1.21 -0.66060 0.509 0.04 4.81
is_surgeryTRUE 0.76 0.62 -0.45604 0.648 0.23 2.52
has_deficitTRUE 1.14 0.56 0.23537 0.814 0.38 3.42
has_associated_aneurysmTRUE 1.12 0.62 0.17563 0.861 0.33 3.76
has_deep_venous_drainageTRUE 1.01 0.56 0.01222 0.990 0.34 3.01
venous_drainageSuperficial 0.99 0.56 -0.01222 0.990 0.33 2.97

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_postop_within_1_week 6.26 0.33 5.51207 0.000 3.26 12.01
modified_rankin_score_pretreatment 1.66 0.16 3.12554 0.002 1.21 2.29
is_eloquent_locationTRUE 3.90 0.61 2.23437 0.025 1.18 12.85
spetzler_martin_grade 1.81 0.28 2.12267 0.034 1.05 3.14
is_spetzler_martin_grade_less_than_4TRUE 0.27 0.65 -2.04113 0.041 0.08 0.95
lawton_young_grade 1.44 0.19 1.88025 0.060 0.98 2.09
is_diffuse_nidusTRUE 3.47 0.71 1.75503 0.079 0.86 13.95
size_score 1.93 0.43 1.50957 0.131 0.82 4.52
locationDeep 2.74 0.68 1.48958 0.136 0.73 10.31
max_size_cm 1.22 0.14 1.45185 0.147 0.93 1.59
procedure_combinationsR 0.10 1.62 -1.43661 0.151 0.00 2.34
has_seizuresTRUE 0.23 1.07 -1.37361 0.170 0.03 1.87
procedure_combinationsS 0.19 1.29 -1.26767 0.205 0.02 2.44
modified_rankin_score_presentation 1.21 0.17 1.13510 0.256 0.87 1.69
has_paresisTRUE 1.95 0.59 1.12525 0.260 0.61 6.25
procedure_combinationsMultimodal 0.31 1.27 -0.92627 0.354 0.03 3.72
locationHemispheric 0.59 0.69 -0.75573 0.450 0.15 2.30
has_deficitTRUE 1.58 0.62 0.73729 0.461 0.47 5.39
is_surgeryTRUE 0.72 0.62 -0.53897 0.590 0.21 2.42
has_associated_aneurysmTRUE 1.34 0.63 0.46332 0.643 0.39 4.64
has_deep_venous_drainageTRUE 1.10 0.59 0.16731 0.867 0.35 3.47
venous_drainageSuperficial 0.91 0.59 -0.16731 0.867 0.29 2.86

Interaction analysis

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]

adjustors <- c(
  "age_at_first_treatment_yrs",
  "is_male"
)

# Initialize
out <- list()
df <- df_multi
k <- 1

for (i in seq_along(cols)) {
  # Do not fit model for variables that we are adjusting for
  if (cols[i] %in% adjustors) {
    next
  }

  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i],
    "* has_hemorrhage",
    sep = ""
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[k]] <- stylized
  k <- k + 1
}

Print all results results.

# Define values
unwanted <- c(
  "is_maleTRUE",
  "age_at_first_treatment_yrs",
  "(Intercept)",
  "has_hemorrhageTRUE"
)

# Initialize objects
isolated_values <- list()

# Get main and interaction effects
for (i in seq_along(out)) {
  # Get data
  result <- out[[i]]

  # Get all main effects of interest
  main_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(!str_detect(Predictors, ":")) %>%
    select(-"SE", -"Z-scores")

  # Get the interaction effects
  interaction_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(str_detect(Predictors, ":")) %>%
    mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
    select(-"SE", -"Z-scores") %>%
    rename_with(~ paste("Interaction -", .), -Predictors)

  isolated_values[[i]] <-
    main_effects %>%
    left_join(interaction_effects, by = "Predictors")
}

Create and print table

# Create table
univariable_adjusted_recoded_interactions <-
  isolated_values %>%
  bind_rows() %>%
  arrange(`Interaction - P-values`)

# Print
univariable_adjusted_recoded_interactions %>%
  sable()
Predictors Odds Ratios (OR) P-values CI (low) CI (high) Interaction - Odds Ratios (OR) Interaction - P-values Interaction - CI (low) Interaction - CI (high)
modified_rankin_score_presentation 1.21 0.256 0.87 1.69
modified_rankin_score_pretreatment 1.66 0.002 1.21 2.29
modified_rankin_score_postop_within_1_week 6.26 0.000 3.26 12.01
max_size_cm 1.22 0.147 0.93 1.59
spetzler_martin_grade 1.81 0.034 1.05 3.14
lawton_young_grade 1.44 0.060 0.98 2.09
size_score 1.93 0.131 0.82 4.52
is_eloquent_locationTRUE 3.90 0.025 1.18 12.85
has_deep_venous_drainageTRUE 1.10 0.867 0.35 3.47
is_diffuse_nidusTRUE 3.47 0.079 0.86 13.95
has_seizuresTRUE 0.23 0.170 0.03 1.87
has_deficitTRUE 1.58 0.461 0.47 5.39
has_paresisTRUE 1.95 0.260 0.61 6.25
has_associated_aneurysmTRUE 1.34 0.643 0.39 4.64
is_spetzler_martin_grade_less_than_4TRUE 0.27 0.041 0.08 0.95
is_surgeryTRUE 0.72 0.590 0.21 2.42
locationDeep 2.74 0.136 0.73 10.31
locationHemispheric 0.59 0.450 0.15 2.30
venous_drainageSuperficial 0.91 0.867 0.29 2.86
procedure_combinationsR 0.10 0.151 0.00 2.34
procedure_combinationsS 0.19 0.205 0.02 2.44
procedure_combinationsMultimodal 0.31 0.354 0.03 3.72

Eloquence.

df %>%
  count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, is_eloquent_location) %>%
  group_by(has_hemorrhage, is_eloquent_location) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage is_eloquent_location is_poor_outcome n num_total pct_total
TRUE FALSE FALSE 58 61 95%
TRUE FALSE TRUE 3 61 5%
TRUE TRUE FALSE 64 76 84%
TRUE TRUE TRUE 12 76 16%

Deficit.

df %>%
  count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_deficit) %>%
  group_by(has_hemorrhage, has_deficit) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_deficit is_poor_outcome n num_total pct_total
TRUE FALSE FALSE 77 86 90%
TRUE FALSE TRUE 9 86 10%
TRUE TRUE FALSE 45 51 88%
TRUE TRUE TRUE 6 51 12%

Seizures.

df %>%
  count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_seizures) %>%
  group_by(has_hemorrhage, has_seizures) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_seizures is_poor_outcome n num_total pct_total
TRUE FALSE FALSE 94 108 87%
TRUE FALSE TRUE 14 108 13%
TRUE TRUE FALSE 28 29 97%
TRUE TRUE TRUE 1 29 3%

Selective inference

Read the following guides:

Setup

Define unwanted columns.

unwanted_all <- c(
  # Use values at or before first treatment
  "modified_rankin_score_postop_within_1_week",
  "modified_rankin_score_final",
  # Use the split between high vs. low grade instead of the granular
  "spetzler_martin_grade"
)

# Remove hemorrhage from the mix
if (str_detect(params$TITLE, "Hemorrhage")) {
  unwanted_all <- c(unwanted_all, "has_hemorrhage")
}

unwanted_without_gradings <- c(
  unwanted_all,
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4",
  "size_score", # Already covered by the more detailed max_size_cm
  "venous_drainage", # Already covered by has_deep_venous_drainage
  "location", # Already covered to some extent by eloquence
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)

# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
  "size_score",
  "max_size_cm",
  "is_eloquent_location",
  # "location",  # Include this as not highly correlated with SM
  "has_deep_venous_drainage",
  "venous_drainage",
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # "is_diffuse_nidus",
  # "has_hemorrhage",
  # Use values at or before first treatment
  "modified_rankin_score_final",
  "modified_rankin_score_postop_within_1_week",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation"
  # "modified_rankin_score_pretreatment"
)

Create dataset.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  CHOSEN_OUTCOME
))

# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]

# Create df of interest
df_all <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_all)) %>%
  drop_na()

df_with_gradings <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_with_gradings)) %>%
  drop_na()

df_no_scores <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_without_gradings)) %>%
  drop_na()

# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))

# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)

# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled)
 [1] "age_at_first_treatment_yrs"         "modified_rankin_score_pretreatment"
 [3] "max_size_cm"                        "is_maleFALSE"                      
 [5] "is_maleTRUE"                        "is_eloquent_locationTRUE"          
 [7] "has_deep_venous_drainageTRUE"       "is_diffuse_nidusTRUE"              
 [9] "has_seizuresTRUE"                   "has_deficitTRUE"                   
[11] "has_paresisTRUE"                    "has_associated_aneurysmTRUE"       
[13] "is_surgeryTRUE"                     "procedure_combinationsER"          
[15] "procedure_combinationsERS"          "procedure_combinationsES"          
[17] "procedure_combinationsR"            "procedure_combinationsRS"          
[19] "procedure_combinationsS"           

Stepwise

Use step-wise linear regression methods.

# Set seed
set.seed(33)

# Run forward step-wise
fsfit <- fs(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  maxsteps = 2000,
  intercept = TRUE,
  normalize = FALSE
)

# Estimate sigma
sigmahat <- estimateSigma(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  intercept = TRUE,
  standardize = FALSE
)$sigmahat
# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")

Standard deviation of noise (specified or estimated) sigma = 0.297

Sequential testing results with alpha = 0.100
 Step Var        Coef Z-score P-value   LowConfPt   UpConfPt LowTailArea UpTailArea
    1   2  5.6000e-02   3.438   0.016  1.6000e-02 8.3000e-02       0.050       0.05
    2   6  1.0100e-01   1.954   0.848 -2.0930e+00 1.2200e-01       0.050       0.05
    3   1 -1.5000e-02  -2.054   0.089 -2.2600e-01 1.2000e-02       0.050       0.05
    4   9 -1.0400e-01  -1.641   0.811 -2.7300e-01 2.2840e+00       0.050       0.05
    5   3  2.1000e-02   1.615   0.580 -1.1000e+00 7.4300e-01       0.050       0.05
    6   8  8.8000e-02   0.955   0.122 -9.8500e-01        Inf       0.050       0.00
    7  16 -6.1000e-02  -0.734   0.377 -2.0250e+00 1.2090e+00       0.050       0.05
    8  17 -4.8000e-02  -0.615   0.560 -1.1160e+00 1.5360e+00       0.050       0.05
    9  13 -5.2000e-02  -0.701   0.464 -9.0600e-01 8.3400e-01       0.050       0.05
   10  14 -1.2400e-01  -0.741   0.734 -3.4900e-01 1.8880e+00       0.049       0.05
   11   4  2.7000e-02   0.510   0.704 -2.7040e+00 9.6500e-01       0.050       0.05
   12   5  1.9003e+14   0.773   0.701 -1.2169e+16 4.3624e+15       0.050       0.05
   13  12 -3.4000e-02  -0.517   0.506 -2.6070e+00 2.7190e+00       0.050       0.05
   14  15  5.2000e-02   0.501   0.175 -8.1600e-01 5.4330e+00       0.000       0.05
   15   7 -1.7000e-02  -0.286   0.743 -4.4600e-01 1.6220e+00       0.050       0.05
   16  11  1.3000e-02   0.192   0.722 -3.9980e+00 1.3000e+00       0.050       0.05
   17  10 -3.1000e-02  -0.318   0.566 -1.2070e+00 1.6720e+00       0.050       0.05
   18  19  1.2000e-02   0.134   0.292 -1.7500e+00 4.8670e+00       0.050       0.05
   19  18  2.2511e+13   0.302   0.411 -1.2799e+15 1.8807e+15       0.050       0.05

Estimated stopping point from ForwardStop rule = 1
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")

Standard deviation of noise (specified or estimated) sigma = 0.297

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   2  0.050   3.019   0.035     0.006    0.103        0.05       0.05
   6  0.091   1.672   0.668    -3.043    1.275        0.05       0.05
   1 -0.017  -2.324   0.125    -0.228    0.025        0.05       0.05
   9 -0.108  -1.695   0.803    -0.281    2.202        0.05       0.05
   3  0.021   1.615   0.580    -1.100    0.743        0.05       0.05

Estimated stopping point from AIC rule = 5
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix

Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")

Standard deviation of noise (specified or estimated) sigma = 0.297

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   2  0.050   3.019   0.035     0.006    0.103        0.05       0.05
   6  0.091   1.672   0.668    -3.043    1.275        0.05       0.05
   1 -0.017  -2.324   0.125    -0.228    0.025        0.05       0.05
   9 -0.108  -1.695   0.803    -0.281    2.202        0.05       0.05
   3  0.021   1.615   0.580    -1.100    0.743        0.05       0.05

LASSO - all

Use LASSO logistic regression using all available variables.

# Set seed
set.seed(141845)

# Define values
X <- X_all
y <- y_all

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
modified_rankin_score_pretreatment 1.66779 0.00951 1.18922 2.5456
age_at_first_treatment_yrs 0.83280 0.10778 0.72823 1.0705
is_spetzler_martin_grade_less_than_4TRUE 0.43027 0.14865 0.00012 3.7943
has_seizuresTRUE 0.18653 0.23048 0.05530 7.0961
locationDeep 2.86738 0.25396 0.12200 142.9458
is_eloquent_locationTRUE 1.64450 0.75755 0.00009 4.7487
is_diffuse_nidusTRUE 1.56935 0.89756 0.00000 2.3155

LASSO - without scores

Use LASSO logistic regression not based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_without_gradings
y <- y_without_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
age_at_first_treatment_yrs 0.84408 0.09042 0.47321 1.0635e+00
modified_rankin_score_pretreatment 1.65394 0.12071 0.75732 5.4063e+00
is_eloquent_locationTRUE 2.65386 0.16886 0.26928 4.7811e+02
has_seizuresTRUE 0.27239 0.26889 0.00251 1.6292e+01
is_diffuse_nidusTRUE 1.56820 0.42228 0.00015 1.4292e+05
max_size_cm 1.11306 0.54153 0.33169 2.2316e+00
procedure_combinationsERS 1.77744 0.82187 0.00000 2.2410e+01
procedure_combinationsES 0.47431 0.93123 0.53462 1.1049e+26

LASSO - without components

Use LASSO logistic regression based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_with_gradings
y <- y_with_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
modified_rankin_score_pretreatment 1.76041 0.04550 1.02081 6.9717e+00
age_at_first_treatment_yrs 0.81386 0.05071 0.50653 1.0026e+00
spetzler_martin_grade 1.56678 0.07427 0.80658 1.0833e+03
locationDeep 2.78545 0.20600 0.19579 2.2890e+02
procedure_combinationsERS 1.71038 0.47244 0.00000 1.5773e+07
has_seizuresTRUE 0.18650 0.51027 0.01669 1.5061e+03
is_diffuse_nidusTRUE 1.13422 0.87622 0.00000 2.4812e+01
procedure_combinationsES 0.45126 0.88005 0.20331 1.8598e+15

Multivariable - Without scores

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "is_eloquent_location"              
[3] "is_diffuse_nidus"                   "max_size_cm"                       
[5] "procedure_combinations"             "has_seizures"                      
[7] "age_at_first_treatment_yrs"        

Model selection

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.74 0.21 2.63949 0.008 1.17 2.72
age_at_first_treatment_yrs 0.81 0.09 -2.17419 0.030 0.67 0.97
is_eloquent_locationTRUE 4.54 0.82 1.83805 0.066 1.01 27.54
has_seizuresTRUE 0.14 1.25 -1.56033 0.119 0.01 1.09
procedure_combinationsR 0.10 1.72 -1.34998 0.177 0.00 3.88
procedure_combinationsS 0.16 1.41 -1.30445 0.192 0.01 4.27
procedure_combinationsMultimodal 0.18 1.37 -1.27121 0.204 0.01 4.55
is_diffuse_nidusTRUE 2.48 0.94 0.96548 0.334 0.37 16.42
max_size_cm 1.07 0.14 0.49016 0.624 0.79 1.43
(Intercept) 0.42 1.72 -0.50353 0.615 0.01 10.21

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.74 0.20 2.84769 0.004 1.19 2.56
has_seizuresTRUE 0.14 0.93 -2.11277 0.035 0.02 0.87
is_eloquent_locationTRUE 4.54 0.74 2.04600 0.041 1.07 19.31
age_at_first_treatment_yrs 0.81 0.10 -2.02801 0.043 0.67 0.99
procedure_combinationsS 0.16 1.56 -1.17922 0.238 0.01 3.37
procedure_combinationsR 0.10 1.99 -1.16653 0.243 0.00 4.86
procedure_combinationsMultimodal 0.18 1.56 -1.11405 0.265 0.01 3.74
is_diffuse_nidusTRUE 2.48 0.85 1.07241 0.284 0.47 13.10
max_size_cm 1.07 0.14 0.52187 0.602 0.82 1.40
(Intercept) 0.42 2.05 -0.42369 0.672 0.01 23.17

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 135 (15 with outcome)
  Fitted examples = 135 (15 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
      llh   llhNull        G2  McFadden      r2ML      r2CU 
-34.24172 -47.09233  25.70123   0.27288   0.17335   0.34515 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.858 (SE, 0.039; 95% CI, 0.782-0.935)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.85833 0.85833
m1 1 PRC 0.46874 0.46874

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.858

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event        3        2
  No Event    12      118
                                        
               Accuracy : 0.896         
                 95% CI : (0.832, 0.942)
    No Information Rate : 0.889         
    P-Value [Acc > NIR] : 0.4596        
                                        
                  Kappa : 0.259         
                                        
 Mcnemar's Test P-Value : 0.0162        
                                        
            Sensitivity : 0.2000        
            Specificity : 0.9833        
         Pos Pred Value : 0.6000        
         Neg Pred Value : 0.9077        
             Prevalence : 0.1111        
         Detection Rate : 0.0222        
   Detection Prevalence : 0.0370        
      Balanced Accuracy : 0.5917        
                                        
       'Positive' Class : Event         
                                        

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - Without components

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_unadjusted_recoded %>%
  bind_rows(univariable_adjusted) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "spetzler_martin_grade"             
[3] "age_at_first_treatment_yrs"         "is_diffuse_nidus"                  
[5] "has_seizures"                       "location"                          
[7] "procedure_combinations"            

Model selection

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

fit_with_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.88 0.24 2.67204 0.008 1.22 3.15
age_at_first_treatment_yrs 0.76 0.11 -2.53186 0.011 0.61 0.93
has_seizuresTRUE 0.07 1.45 -1.84460 0.065 0.00 0.74
spetzler_martin_grade 2.05 0.42 1.69791 0.090 0.91 4.95
locationDeep 6.15 1.16 1.56383 0.118 0.74 76.27
procedure_combinationsMultimodal 0.11 1.52 -1.47477 0.140 0.01 3.25
procedure_combinationsR 0.07 1.79 -1.46614 0.143 0.00 3.12
procedure_combinationsS 0.25 1.51 -0.91818 0.359 0.01 7.55
is_diffuse_nidusTRUE 1.48 0.96 0.41098 0.681 0.21 9.91
locationHemispheric 1.20 0.98 0.18993 0.849 0.19 9.94
(Intercept) 0.19 2.00 -0.82971 0.407 0.00 7.21

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.88 0.21 2.95199 0.003 1.24 2.86
age_at_first_treatment_yrs 0.76 0.11 -2.44907 0.014 0.61 0.95
has_seizuresTRUE 0.07 1.10 -2.43244 0.015 0.01 0.59
locationDeep 6.15 1.14 1.59388 0.111 0.66 57.32
spetzler_martin_grade 2.05 0.52 1.38854 0.165 0.74 5.66
procedure_combinationsMultimodal 0.11 1.68 -1.33442 0.182 0.00 2.86
procedure_combinationsR 0.07 2.09 -1.26125 0.207 0.00 4.29
procedure_combinationsS 0.25 1.46 -0.94616 0.344 0.01 4.40
is_diffuse_nidusTRUE 1.48 0.89 0.44201 0.658 0.26 8.53
locationHemispheric 1.20 0.95 0.19697 0.844 0.19 7.69
(Intercept) 0.19 2.65 -0.62758 0.530 0.00 34.16

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 137 (15 with outcome)
  Fitted examples = 135 (15 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
      llh   llhNull        G2  McFadden      r2ML      r2CU 
-32.05342 -47.09233  30.07783   0.31935   0.19972   0.39766 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.888 (SE, 0.036; 95% CI, 0.818-0.958)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.88778 0.88778
m1 1 PRC 0.54014 0.54014

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.888

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event        3        2
  No Event    12      118
                                        
               Accuracy : 0.896         
                 95% CI : (0.832, 0.942)
    No Information Rate : 0.889         
    P-Value [Acc > NIR] : 0.4596        
                                        
                  Kappa : 0.259         
                                        
 Mcnemar's Test P-Value : 0.0162        
                                        
            Sensitivity : 0.2000        
            Specificity : 0.9833        
         Pos Pred Value : 0.6000        
         Neg Pred Value : 0.9077        
             Prevalence : 0.1111        
         Detection Rate : 0.0222        
   Detection Prevalence : 0.0370        
      Balanced Accuracy : 0.5917        
                                        
       'Positive' Class : Event         
                                        

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - With interactions

Presents with hemorrhage

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  "modified_rankin_score_pretreatment*has_hemorrhage +
  spetzler_martin_grade*has_hemorrhage +
  age_at_first_treatment_yrs*has_hemorrhage"
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.81 0.20 3.0036 0.003 1.25 2.75
spetzler_martin_grade 2.07 0.28 2.5986 0.009 1.23 3.76
age_at_first_treatment_yrs 0.82 0.09 -2.1075 0.035 0.68 0.98
has_hemorrhageTRUE
modified_rankin_score_pretreatment:has_hemorrhageTRUE
has_hemorrhageTRUE:spetzler_martin_grade
has_hemorrhageTRUE:age_at_first_treatment_yrs
(Intercept) 0.02 1.38 -2.7271 0.006 0.00 0.29

Model comparison

Likelihood ratio

No statistical evidence that the grading-based model fits the data better than the grading-independent model based on the likelihood ratio test.

anova(fit_without_grading, fit_with_grading, test = "LR")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
125 68.483
124 64.107 1 4.3766 0.03644

Special cases

SM < 4

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment"   "is_diffuse_nidus"                    
[3] "location"                             "procedure_combinations"              
[5] "has_seizures"                         "age_at_first_treatment_yrs"          
[7] "is_spetzler_martin_grade_less_than_4"

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.84 0.23 2.62903 0.009 1.20 3.05
age_at_first_treatment_yrs 0.79 0.10 -2.33246 0.020 0.64 0.95
has_seizuresTRUE 0.06 1.45 -1.94563 0.052 0.00 0.64
locationDeep 7.57 1.14 1.78088 0.075 0.96 90.68
is_spetzler_martin_grade_less_than_4TRUE 0.25 0.84 -1.62524 0.104 0.05 1.33
procedure_combinationsMultimodal 0.12 1.51 -1.41032 0.158 0.01 3.60
procedure_combinationsR 0.09 1.77 -1.33410 0.182 0.00 4.02
procedure_combinationsS 0.23 1.52 -0.96180 0.336 0.01 7.10
is_diffuse_nidusTRUE 1.95 0.91 0.73068 0.465 0.30 11.74
locationHemispheric 1.11 0.97 0.10903 0.913 0.18 8.99
(Intercept) 2.60 1.90 0.50352 0.615 0.05 103.54

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.84 0.20 3.10006 0.002 1.25 2.71
has_seizuresTRUE 0.06 0.99 -2.85285 0.004 0.01 0.41
age_at_first_treatment_yrs 0.79 0.11 -2.10963 0.035 0.63 0.98
locationDeep 7.57 1.11 1.82550 0.068 0.86 66.49
is_spetzler_martin_grade_less_than_4TRUE 0.25 0.88 -1.55278 0.120 0.05 1.43
procedure_combinationsMultimodal 0.12 1.70 -1.25886 0.208 0.00 3.28
procedure_combinationsR 0.09 2.15 -1.10135 0.271 0.00 6.31
procedure_combinationsS 0.23 1.55 -0.93935 0.348 0.01 4.88
is_diffuse_nidusTRUE 1.95 0.84 0.79299 0.428 0.37 10.13
locationHemispheric 1.11 0.86 0.12300 0.902 0.21 6.03
(Intercept) 2.60 2.11 0.45357 0.650 0.04 163.13

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Had surgery

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "spetzler_martin_grade"             
[3] "is_diffuse_nidus"                   "location"                          
[5] "has_seizures"                       "age_at_first_treatment_yrs"        
[7] "is_surgery"                        

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.87 0.22 2.84382 0.004 1.25 3.01
age_at_first_treatment_yrs 0.79 0.10 -2.40677 0.016 0.64 0.95
has_seizuresTRUE 0.07 1.39 -1.89267 0.058 0.00 0.70
spetzler_martin_grade 1.84 0.40 1.52779 0.127 0.84 4.15
locationDeep 4.20 1.08 1.32925 0.184 0.57 43.25
is_surgeryTRUE 1.84 0.89 0.68341 0.494 0.31 11.28
is_diffuse_nidusTRUE 1.53 0.94 0.45135 0.652 0.22 9.98
locationHemispheric 0.89 0.92 -0.12813 0.898 0.16 6.30
(Intercept) 0.03 1.74 -1.98131 0.048 0.00 0.72

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.87 0.18 3.42492 0.001 1.31 2.68
has_seizuresTRUE 0.07 1.03 -2.54071 0.011 0.01 0.55
age_at_first_treatment_yrs 0.79 0.10 -2.41164 0.016 0.65 0.96
locationDeep 4.20 0.92 1.55658 0.120 0.69 25.63
spetzler_martin_grade 1.84 0.47 1.30683 0.191 0.74 4.61
is_surgeryTRUE 1.84 0.93 0.65228 0.514 0.29 11.46
is_diffuse_nidusTRUE 1.53 0.89 0.47657 0.634 0.27 8.85
locationHemispheric 0.89 0.80 -0.14657 0.883 0.18 4.28
(Intercept) 0.03 2.39 -1.44828 0.148 0.00 3.39

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Grade 5s

df_uni %>%
  drop_na(spetzler_martin_grade) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 1 26 19%
FALSE 2 30 22%
FALSE 3 42 31%
FALSE 4 17 12%
FALSE 5 7 5%
TRUE 1 2 1%
TRUE 2 2 1%
TRUE 3 4 3%
TRUE 4 4 3%
TRUE 5 3 2%

Only consider those with SM grade 5.

df_uni %>%
  filter(spetzler_martin_grade == 5) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 5 7 70%
TRUE 5 3 30%

Write

Setup

Create the necessary directories.

# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")

# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")

# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))

# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)
[1] "../../outputs//predictive-analytics/poor-outcomes_has-hemorrhageTRUE"
[1] "../../outputs//predictive-analytics/poor-outcomes_has-hemorrhageTRUE/2024-07-30"
[1] "../../outputs//predictive-analytics/poor-outcomes_has-hemorrhageTRUE/2024-07-30/csv"
[1] "../../outputs//predictive-analytics/poor-outcomes_has-hemorrhageTRUE/2024-07-30/pdf"
[1] "../../outputs//predictive-analytics/poor-outcomes_has-hemorrhageTRUE/2024-07-30/html"

Figures

Write all figures.

# Save
# ggsave(
#   file.path(pdf_folder, "histogram_num_days.pdf"),
#   plot = plots$histogram_num_days,
#   width = 6,
#   height = 6
# )
# # Start graphic device
# pdf(
#   file = file.path(pdf_folder, "forest-plot_frequentist.pdf"),
#   width = 10,
#   height = 15
# )
#
# # Plot
# plots$forest_plot_frequentist
#
# # Shut down device
# dev.off()

Tables

Write all tables.

# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
#   sable() %>%
#   kableExtra::save_kable(file = filepath_html)
write_csv(
  univariable_unadjusted,
  file.path(csv_folder, "univariable_unadjusted.csv")
)
write_csv(
  univariable_adjusted,
  file.path(csv_folder, "univariable_adjusted.csv")
)
write_csv(
  multivariable_pvalue,
  file.path(csv_folder, "multivariable_pvalue.csv")
)

Data

Write all data.

# # Arguments
# df <- study
# filename <- paste0(ANALYSIS_NAME, "_", today, ".csv")
# filepath_csv <- file.path(DST_BUCKET, dst_dirname_data, filename)
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )

Reproducibility

Linting and styling

# Style current file
styler::style_file(
  path = rstudioapi::getSourceEditorContext()$path,
  style = styler::tidyverse_style
)

# Lint current file
lintr::lint(rstudioapi::getSourceEditorContext()$path)

Dependency management

# Clean up project of libraries not in use
# (use prompt = FALSE to avoid the interactive session)
# (packages can only be removed in interactive mode b/c this is destructive)
renv::clean(prompt = TRUE)

# Update lock file with new packages
renv::snapshot()

Containerization

# Only run this if option is set to TRUE
if (UPDATE_DOCKERFILE) {
  # Create a dockerfile from the session info
  my_dockerfile <- containerit::dockerfile(from = sessionInfo(), env = ls())
  # Write file
  write(my_dockerfile, file = "~/Dockerfile")
  # Print
  print(my_dockerfile)
}

Documentation

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0

attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base

other attached packages:
[1] rmarkdown_2.25 table1_1.4.3 shiny_1.8.0
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
[13] selectiveInference_1.2.5 MASS_7.3-60 adaptMCMC_1.5
[16] coda_0.19-4.1 survival_3.5-7 intervals_0.15.4
[19] glmnet_4.1-8 Matrix_1.6-1.1 magrittr_2.0.3
[22] ggplot2_3.5.0 kableExtra_1.4.0 rmdformats_1.0.4
[25] knitr_1.45

loaded via a namespace (and not attached):
[1] splines_4.3.2 later_1.3.2 versions_0.3
[4] R.oo_1.26.0 hardhat_1.4.0 pROC_1.18.5
[7] rpart_4.1.21 rex_1.2.1 lifecycle_1.0.4
[10] tcltk_4.3.2 globals_0.16.3 processx_3.8.3
[13] lattice_0.21-9 vroom_1.6.5 backports_1.4.1
[16] sass_0.4.8 jquerylib_0.1.4 yaml_2.3.8
[19] remotes_2.4.2.1 httpuv_1.6.14 summarytools_1.0.1
[22] pkgload_1.3.4 R.cache_0.16.0 R.utils_2.12.3
[25] styler_1.10.2 nnet_7.3-19 sandwich_3.1-0
[28] ipred_0.9-14 lava_1.8.0 listenv_0.9.1
[31] parallelly_1.37.1 svglite_2.1.3 codetools_0.2-19
[34] xml2_1.3.6 tidyselect_1.2.1 precrec_0.14.4
[37] shape_1.4.6.1 futile.logger_1.4.3 farver_2.1.1
[40] matrixStats_1.3.0 stats4_4.3.2 base64enc_0.1-3
[43] jsonlite_1.8.8 caret_6.0-94 e1071_1.7-14
[46] ellipsis_0.3.2 Formula_1.2-5 iterators_1.0.14
[49] systemfonts_1.0.5 foreach_1.5.2 tools_4.3.2
[52] pryr_0.1.6 Rcpp_1.0.12 glue_1.7.0
[55] gridExtra_2.3 prodlim_2023.08.28 xfun_0.42
[58] withr_3.0.0 formatR_1.14 BiocManager_1.30.22
[61] fastmap_1.1.1 fansi_1.0.6 callr_3.7.5
[64] digest_0.6.34 lintr_3.1.1 timechange_0.3.0
[67] R6_2.5.1 mime_0.12 colorspace_2.1-0
[70] R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
[73] renv_1.0.4 data.table_1.15.4 recipes_1.0.10
[76] class_7.3-22 stevedore_0.9.6 ModelMetrics_1.2.2.2
[79] pkgconfig_2.0.3 gtable_0.3.4 timeDate_4032.109
[82] lmtest_0.9-40 containerit_0.6.0.9004 htmltools_0.5.7
[85] bookdown_0.39 scales_1.3.0 cyclocomp_1.1.1
[88] gower_1.0.1 lambda.r_1.2.4 rstudioapi_0.15.0
[91] tzdb_0.4.0 reshape2_1.4.4 checkmate_2.3.1
[94] nlme_3.1-163 curl_5.2.1 ggcorrplot_0.1.4.1
[97] proxy_0.4-27 zoo_1.8-12 cachem_1.0.8
[100] miniUI_0.1.1.1 desc_1.4.3 pillar_1.9.0
[103] grid_4.3.2 vctrs_0.6.5 pscl_1.5.9
[106] promises_1.2.1 shinyFiles_0.9.3 xtable_1.8-4
[109] evaluate_0.23 cvAUC_1.1.4 magick_2.8.3
[112] cli_3.6.2 compiler_4.3.2 futile.options_1.0.1
[115] rlang_1.1.3 crayon_1.5.2 future.apply_1.11.2
[118] labeling_0.4.3 ps_1.7.6 plyr_1.8.9
[121] fs_1.6.3 stringi_1.8.3 pander_0.6.5
[124] viridisLite_0.4.2 assertthat_0.2.1 munsell_0.5.0
[127] lazyeval_0.2.2 rapportools_1.1 hms_1.1.3
[130] bit64_4.0.5 future_1.33.2 highr_0.10
[133] ROCR_1.0-11 semver_0.2.0 broom_1.0.6
[136] memoise_2.0.1 bslib_0.6.1 bit_4.0.5

References

[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic Adaptive Monte
Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.

[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence Diagnosis and
Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.

[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar
of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.

[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical Variables
(Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.

[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.

[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of Statistical
Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.

Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization Paths for
Cox's Proportional Hazards Model via Coordinate Descent." _Journal of
Statistical Software_, *39*(5), 1-13. doi:10.18637/jss.v039.i05
<https://doi.org/10.18637/jss.v039.i05>.

Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization Paths for
All Generalized Linear Models." _Journal of Statistical Software_, *106*(1),
1-31. doi:10.18637/jss.v106.i01 <https://doi.org/10.18637/jss.v106.i01>.

[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[11]]
Bourgon R (2023). _intervals: Tools for Working with Points and Intervals_. R
package version 0.15.4, <https://CRAN.R-project.org/package=intervals>.

[[12]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe
Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.

[[13]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report Generation
in R_. R package version 1.45, <https://yihui.org/knitr/>.

Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <https://yihui.org/knitr/>.

Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in R." In
Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible Computational
Research_. Chapman and Hall/CRC. ISBN 978-1466561595.

[[14]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate."
_Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.

[[15]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package
version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

[[16]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_, Fourth
edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.

[[17]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix Classes
and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.

[[18]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[20]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package
version 1.0.2, <https://CRAN.R-project.org/package=purrr>.

[[21]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R
package version 2.1.5, <https://CRAN.R-project.org/package=readr>.

[[22]]
Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A,
Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown: Dynamic Documents
for R_. R package version 2.25, <https://github.com/rstudio/rmarkdown>.

Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive Guide_.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338,
<https://bookdown.org/yihui/rmarkdown>.

Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,
<https://bookdown.org/yihui/rmarkdown-cookbook>.

[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.

[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J (2019).
_selectiveInference: Tools for Post-Selection Inference_. R package version
1.2.5, <https://CRAN.R-project.org/package=selectiveInference>.

[[25]]
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson
J, Dipert A, Borges B (2023). _shiny: Web Application Framework for R_. R
package version 1.8.0, <https://CRAN.R-project.org/package=shiny>.

[[26]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[27]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String
Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.

[[28]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package version
3.5-7, <https://CRAN.R-project.org/package=survival>.

Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival Data:
Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.

[[29]]
Rich B (2023). _table1: Tables of Descriptive Statistics in HTML_. R package
version 1.4.3, <https://CRAN.R-project.org/package=table1>.

[[30]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version
3.2.1, <https://CRAN.R-project.org/package=tibble>.

[[31]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package
version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.

[[32]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G,
Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K,
Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K,
Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_,
*4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[[33]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.